inull 2
Sparse Gaussian Graphical Models with Discrete Optimization: Computational and Statistical Perspectives
Behdin, Kayhan, Chen, Wenyu, Mazumder, Rahul
We consider the problem of learning a sparse graph underlying an undirected Gaussian graphical model, a key problem in statistical machine learning. Given $n$ samples from a multivariate Gaussian distribution with $p$ variables, the goal is to estimate the $p \times p$ inverse covariance matrix (aka precision matrix), assuming it is sparse (i.e., has a few nonzero entries). We propose GraphL0BnB, a new estimator based on an $\ell_0$-penalized version of the pseudolikelihood function, while most earlier approaches are based on the $\ell_1$-relaxation. Our estimator can be formulated as a convex mixed integer program (MIP) which can be difficult to compute at scale using off-the-shelf commercial solvers. To solve the MIP, we propose a custom nonlinear branch-and-bound (BnB) framework that solves node relaxations with tailored first-order methods. As a by-product of our BnB framework, we propose large-scale solvers for obtaining good primal solutions that are of independent interest. We derive novel statistical guarantees (estimation and variable selection) for our estimator and discuss how our approach improves upon existing estimators. Our numerical experiments on real/synthetic datasets suggest that our method can solve, to near-optimality, problem instances with $p = 10^4$ -- corresponding to a symmetric matrix of size $p \times p$ with $p^2/2$ binary variables. We demonstrate the usefulness of GraphL0BnB versus various state-of-the-art approaches on a range of datasets.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Research Report > Promising Solution (0.65)
- Research Report > New Finding (0.46)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Regression (0.46)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.46)
A Fourier Approach to Mixture Learning
Qiao, Mingda, Guruganesh, Guru, Rawat, Ankit Singh, Dubey, Avinava, Zaheer, Manzil
We revisit the problem of learning mixtures of spherical Gaussians. Given samples from mixture $\frac{1}{k}\sum_{j=1}^{k}\mathcal{N}(\mu_j, I_d)$, the goal is to estimate the means $\mu_1, \mu_2, \ldots, \mu_k \in \mathbb{R}^d$ up to a small error. The hardness of this learning problem can be measured by the separation $\Delta$ defined as the minimum distance between all pairs of means. Regev and Vijayaraghavan (2017) showed that with $\Delta = \Omega(\sqrt{\log k})$ separation, the means can be learned using $\mathrm{poly}(k, d)$ samples, whereas super-polynomially many samples are required if $\Delta = o(\sqrt{\log k})$ and $d = \Omega(\log k)$. This leaves open the low-dimensional regime where $d = o(\log k)$. In this work, we give an algorithm that efficiently learns the means in $d = O(\log k/\log\log k)$ dimensions under separation $d/\sqrt{\log k}$ (modulo doubly logarithmic factors). This separation is strictly smaller than $\sqrt{\log k}$, and is also shown to be necessary. Along with the results of Regev and Vijayaraghavan (2017), our work almost pins down the critical separation threshold at which efficient parameter learning becomes possible for spherical Gaussian mixtures. More generally, our algorithm runs in time $\mathrm{poly}(k)\cdot f(d, \Delta, \epsilon)$, and is thus fixed-parameter tractable in parameters $d$, $\Delta$ and $\epsilon$. Our approach is based on estimating the Fourier transform of the mixture at carefully chosen frequencies, and both the algorithm and its analysis are simple and elementary. Our positive results can be easily extended to learning mixtures of non-Gaussian distributions, under a mild condition on the Fourier spectrum of the distribution.
On the Complexity of a Practical Primal-Dual Coordinate Method
Alacaoglu, Ahmet, Cevher, Volkan, Wright, Stephen J.
The various methods that have been proposed for (1.1) have favorable complexity guarantees in certain special cases. The plethora of methods and results makes it difficult for both theoreticians and practitioners to choose the method best suited to particular instances of (1.1). In this paper, we focus on improving the theory for an existing method, the PURE-CD algorithm described in [2]. We show that this method achieves or improves best-known complexity results for interesting special cases of (1.1). The state-of-the-art results are currently dispersed around different methods. 1
Layer-Peeled Model: Toward Understanding Well-Trained Deep Neural Networks
Fang, Cong, He, Hangfeng, Long, Qi, Su, Weijie J.
In this paper, we introduce the Layer-Peeled Model, a nonconvex yet analytically tractable optimization program, in a quest to better understand deep neural networks that are trained for a sufficiently long time. As the name suggests, this new model is derived by isolating the topmost layer from the remainder of the neural network, followed by imposing certain constraints separately on the two parts. We demonstrate that the Layer-Peeled Model, albeit simple, inherits many characteristics of well-trained neural networks, thereby offering an effective tool for explaining and predicting common empirical patterns of deep learning training. First, when working on class-balanced datasets, we prove that any solution to this model forms a simplex equiangular tight frame, which in part explains the recently discovered phenomenon of neural collapse in deep learning training [PHD20]. Moreover, when moving to the imbalanced case, our analysis of the Layer-Peeled Model reveals a hitherto unknown phenomenon that we term Minority Collapse, which fundamentally limits the performance of deep learning models on the minority classes. In addition, we use the Layer-Peeled Model to gain insights into how to mitigate Minority Collapse. Interestingly, this phenomenon is first predicted by the Layer-Peeled Model before its confirmation by our computational experiments.
- North America > United States > Pennsylvania (0.04)
- Asia > Middle East > Jordan (0.04)
Feature Purification: How Adversarial Training Performs Robust Deep Learning
Allen-Zhu, Zeyuan, Li, Yuanzhi
Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > Canada > Quebec > Montreal (0.04)
Model Repair: Robust Recovery of Over-Parameterized Statistical Models
Traditional robust estimation assumes that the data are corrupted, and studies methods of estimation that are immune to these corruptions or outliers in the data. In contrast, we explore the setting where the data are "clean" but a statistical model is corrupted after it has been estimated using the data. We study methods for recovering the model that do not require re-estimation from scratch, using only the design and not the original response values. The problem of model repair is motivated from several different perspectives. First, it can be formulated as a well-defined statistical problem that is closely related to, but different from, traditional robust estimation, and that deserves study in its own right. From a more practical perspective, modern machine learning practice is increasingly working with very large statistical models. For example, artificial neural networks having several million parameters are now routinely estimated. It is anticipated that neural networks having trillions of parameters will be built in the coming years, and that large models will be increasingly embedded in systems, where they may be subject to errors and corruption of the parameter values. In this setting, the maintenance of models in a fault tolerant manner becomes a concern.
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Learning Mixtures of Linear Regressions in Subexponential Time via Fourier Moments
Chen, Sitan, Li, Jerry, Song, Zhao
We consider the problem of learning a mixture of linear regressions (MLRs). An MLR is specified by $k$ nonnegative mixing weights $p_1, \ldots, p_k$ summing to $1$, and $k$ unknown regressors $w_1,...,w_k\in\mathbb{R}^d$. A sample from the MLR is drawn by sampling $i$ with probability $p_i$, then outputting $(x, y)$ where $y = \langle x, w_i \rangle + \eta$, where $\eta\sim\mathcal{N}(0,\varsigma^2)$ for noise rate $\varsigma$. Mixtures of linear regressions are a popular generative model and have been studied extensively in machine learning and theoretical computer science. However, all previous algorithms for learning the parameters of an MLR require running time and sample complexity scaling exponentially with $k$. In this paper, we give the first algorithm for learning an MLR that runs in time which is sub-exponential in $k$. Specifically, we give an algorithm which runs in time $\widetilde{O}(d)\cdot\exp(\widetilde{O}(\sqrt{k}))$ and outputs the parameters of the MLR to high accuracy, even in the presence of nontrivial regression noise. We demonstrate a new method that we call "Fourier moment descent" which uses univariate density estimation and low-degree moments of the Fourier transform of suitable univariate projections of the MLR to iteratively refine our estimate of the parameters. To the best of our knowledge, these techniques have never been used in the context of high dimensional distribution learning, and may be of independent interest. We also show that our techniques can be used to give a sub-exponential time algorithm for learning mixtures of hyperplanes, a natural hard instance of the subspace clustering problem.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.13)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Europe > Romania > Sud-Est Development Region > Constanța County > Constanța (0.04)
- (2 more...)
- Overview (0.67)
- Research Report (0.50)
D-SPIDER-SFO: A Decentralized Optimization Algorithm with Faster Convergence Rate for Nonconvex Problems
Pan, Taoxing, Liu, Jun, Wang, Jie
Decentralized optimization algorithms have attracted intensive interests recently, as it has a balanced communication pattern, especially when solving large-scale machine learning problems. Stochastic Path Integrated Differential Estimator Stochastic First-Order method (SPIDER-SFO) nearly achieves the algorithmic lower bound in certain regimes for nonconvex problems. However, whether we can find a decentralized algorithm which achieves a similar convergence rate to SPIDER-SFO is still unclear. To tackle this problem, we propose a decentralized variant of SPIDER-SFO, called decentralized SPIDER-SFO (D-SPIDER-SFO). We show that D-SPIDER-SFO achieves a similar gradient computation cost-- that is, O ( null 3) for finding an null -approximate first-order stationary point--to its centralized counterpart. To the best of our knowledge, D-SPIDER-SFO achieves the state-of-the-art performance for solving nonconvex optimization problems on decentralized networks in terms of the computational cost. Experiments on different network configurations demonstrate the efficiency of the proposed method. Introduction Distributed optimization is a popular technique for solving large scale machine learning problems Li et al. (2014), ranging from visual object recognition Huang et al. (2017); He et al. (2016) to natural language processing V aswani et al. (2017); Devlin et al. (2019). For distributed optimization, a set of workers form a connected computational network, and each worker is assigned a portion of the computing task. The centralized network topology, like parameter server Jianmin et al. (2016); Dean et al. (2012); Li et al. (2014); Zinkevich et al. (2010), consists of a central worker connected with all other workers.
- Asia > Middle East > Jordan (0.04)
- Asia > China (0.04)
- Education > Focused Education > Special Education (0.44)
- Information Technology (0.34)
Generalization Bounds for Neural Networks via Approximate Description Length
We investigate the sample complexity of networks with bounds on the magnitude of its weights. In particular, we consider the class \[ H=\left\{W_t\circ\rho\circ \ldots\circ\rho\circ W_{1} :W_1,\ldots,W_{t-1}\in M_{d, d}, W_t\in M_{1,d}\right\} \] where the spectral norm of each $W_i$ is bounded by $O(1)$, the Frobenius norm is bounded by $R$, and $\rho$ is the sigmoid function $\frac{e^x}{1+e^x}$ or the smoothened ReLU function $ \ln (1+e^x)$. We show that for any depth $t$, if the inputs are in $[-1,1]^d$, the sample complexity of $H$ is $\tilde O\left(\frac{dR^2}{\epsilon^2}\right)$. This bound is optimal up to log-factors, and substantially improves over the previous state of the art of $\tilde O\left(\frac{d^2R^2}{\epsilon^2}\right)$. We furthermore show that this bound remains valid if instead of considering the magnitude of the $W_i$'s, we consider the magnitude of $W_i - W_i^0$, where $W_i^0$ are some reference matrices, with spectral norm of $O(1)$. By taking the $W_i^0$ to be the matrices at the onset of the training process, we get sample complexity bounds that are sub-linear in the number of parameters, in many typical regimes of parameters. To establish our results we develop a new technique to analyze the sample complexity of families $H$ of predictors. We start by defining a new notion of a randomized approximate description of functions $f:X\to\mathbb{R}^d$. We then show that if there is a way to approximately describe functions in a class $H$ using $d$ bits, then $d/\epsilon^2$ examples suffices to guarantee uniform convergence. Namely, that the empirical loss of all the functions in the class is $\epsilon$-close to the true loss. Finally, we develop a set of tools for calculating the approximate description length of classes of functions that can be presented as a composition of linear function classes and non-linear functions.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
- Asia > Middle East > Israel > Jerusalem District > Jerusalem (0.04)
- Asia > Middle East > Israel > Tel Aviv District > Tel Aviv (0.04)
Mildly Overparametrized Neural Nets can Memorize Training Data Efficiently
Ge, Rong, Wang, Runzhe, Zhao, Haoyu
However, it was observed that neural networks are able to fit the training data perfectly, even when the data/labels are randomly corrupted(Zhang et al., 2017). Recently, a series of work (Du et al. (2019); Allen-Zhu et al. (2019c); Chizat and Bach (2018); Jacot et al. (2018), see more references in Section 1.2) developed a theory of neural tangent kernels (NTK) that explains the success of training neural networks through overparametrization. Several results showed that if the number of neurons at each layer is much larger than the number of training samples, networks of different architectures (multilayer/recurrent) can all fit the training data perfectly. However, if one considers the number of parameters required for the current theoretical analysis, these networks are highly overparametrized. Consider fully connected networks for example. If a two-layer network has a hidden layer with r neurons, the number of parameters is at least rd where d is the dimension of the input.
- North America (0.14)
- Asia > Middle East > Jordan (0.04)
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)